You may need to install Cairo on your operating system to run this notebook. To learn more see the Cairo documentation at https://www.cairographics.org/download/.
if(!require(Cairo)) install.packages("Cairo", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(car)) install.packages("car", repos = "http://cran.us.r-project.org")
library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)
The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about houses within a particular block.
While this project focuses on prediction we are fully aware and want you the reader to also be aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict the actual future. This is a purely academic endeavor to explore statistical prediction.
The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.
We’re using the California Housing Prices dataset from the following Kaggle site: https://www.kaggle.com/camnugent/california-housing-prices. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.
We loaded housing.csv into R.
housing_data = read_csv("housing.csv")
## Parsed with column specification:
## cols(
## longitude = col_double(),
## latitude = col_double(),
## housing_median_age = col_double(),
## total_rooms = col_double(),
## total_bedrooms = col_double(),
## population = col_double(),
## households = col_double(),
## median_income = col_double(),
## median_house_value = col_double(),
## ocean_proximity = col_character()
## )
housing_data$median_house_value[1:100]
## [1] 452600 358500 352100 341300 342200 269700 299200 241400 226700 261100
## [11] 281500 241800 213500 191300 159200 140000 152500 155500 158700 162900
## [21] 147500 159800 113900 99700 132600 107500 93800 105500 108900 132000
## [31] 122300 115200 110400 104900 109700 97200 104500 103900 191400 176000
## [41] 155400 150000 118800 188800 184400 182300 142500 137500 187500 112500
## [51] 171900 93800 97500 104200 87500 83100 87500 85300 80300 60000
## [61] 75700 75000 86100 76100 73500 78400 84400 81300 85000 129200
## [71] 82500 95200 75000 67500 137500 177500 102100 108300 112500 131300
## [81] 162500 112500 112500 137500 118800 98200 118800 162500 137500 500001
## [91] 162500 137500 162500 187500 179200 130000 183800 125000 170000 193100
The dataset contains 20640 observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.
longitude: A measure of how far west a house is; a higher value is farther westlatitude: A measure of how far north a house is; a higher value is farther northhousing_median_age: Median age of a house within a block; a lower number is a newer buildingtotal_rooms: Total number of rooms within a blocktotal_bedrooms: Total number of bedrooms within a blockpopulation: Total number of people residing within a blockhouseholds: Total number of households, a group of people residing within a home unit, for a blockmedian_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)ocean_proximity: Location of the house w.r.t ocean/seamedian_house_value: Median house value for households within a block (measured in US Dollars)This dataset meets all of the stated criteria for the project including:
median_house_valueocean_proximityLet’s look at a summary of each variable:
summary(housing_data)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Note that the total_bedrooms variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.
Below is a visual representation of all data points in the dataset with longitude on the x-axis, latitude on the y-axis, and median_house_value shown in a color codes.
plot_map = ggplot(housing_data,
aes(x = longitude, y = latitude, color = median_house_value,
hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
hh = households, mi = median_income)) +
geom_point(aes(size = population), alpha = 0.4) +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_distiller(palette = "Paired", labels = comma) +
labs(color = "Median House Value (in $USD)", size = "Population")
plot_map_tt = ggplotly(plot_map)
plot_map_tt
We see that on average the houses nearest to the ocean tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable ocean_proximity will likely play a large role in predicting median house value.
It’s worth noting that some ocean-adjacent locations, like the more isolated towns along California’s northern most coast, have lower median house prices than other coastal housing, and some inland locations, like the towns near destinations such as Yosemite National Park and Lake Tahoe, do have higher median house prices than other inland housing. Also interesting to note that the houses closest to large metropolitan areas such as San Francisco and Los Angeles have among the highest median house values, whereas the largely agricultural Central Valley has among the lowest. This lets us know that we cannot rely on ocean_proximity alone to tell the whole story.
Initial exploration of the data showed us that there were a few steps we needed to take to make the data more useable. Firstly, we changed the categorical variable ocean_proximity from text-based to a factor variable.
housing_data$ocean_proximity = as.factor(housing_data$ocean_proximity)
levels(housing_data$ocean_proximity)
## [1] "<1H OCEAN" "INLAND" "ISLAND" "NEAR BAY" "NEAR OCEAN"
Taking a deeper dive into ocean_proximity, we see that the level ISLAND has a very low count compared to the other levels.
ggplot(housing_data, aes(x = factor(ocean_proximity))) +
geom_bar(stat = "count", color = "black", fill = "blue")
We looked at the total count of each level to get a better idea of the skew in ocean_proximity.
summary(housing_data$ocean_proximity)
## <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
## 9136 6551 5 2290 2658
In fact, ISLAND only has five rows, where every other level has over 2,000 rows. Due to possible issues with fitting models, we decided to remove this level from ocean_proximity. We accepted the risk that our model will less accurately predict the value of houses on islands.
housing_data = housing_data[housing_data$ocean_proximity != "ISLAND", ]
The other thing to consider is missing data.
sum(is.na(housing_data))
## [1] 207
total_bedrooms = housing_data$total_bedrooms
sum(is.na(total_bedrooms))
## [1] 207
There are \(207\) observations with missing data for total_bedrooms. One thing we can do to solve this issue of NA values in total_bedrooms is to impute data points. We first want to take a look at the distribution of this variable to see which imputation method will work best.
bedroom_mean = mean(housing_data$total_bedrooms, na.rm=TRUE)
bedroom_median = median(housing_data$total_bedrooms, na.rm=TRUE)
ggplot(housing_data, aes(x = total_bedrooms)) +
geom_histogram(bins = 40, color = "black", fill = "blue") +
geom_vline(aes(xintercept = bedroom_mean, color = "Mean"), lwd = 1.5) +
geom_vline(aes(xintercept = bedroom_median, color = "Median"), lwd = 1.5) +
xlab("Total Bedrooms") +
ylab("Frequency") +
ggtitle("Histogram of Total Bedrooms (noncontinuous variable)") +
scale_color_manual(name = "Summary Stats", labels = c("Mean", "Median"), values = c("red", "green"))
## Warning: Removed 207 rows containing non-finite values (stat_bin).
Looking at the distribution of the data in the histogram above, we decided to use the median of the total_bedrooms variable for our imputation method.
housing_data$total_bedrooms[is.na(housing_data$total_bedrooms)] = bedroom_median
Looking at the structure of the dataset after cleaning the data, we see that besides the one factor variable ocean_proximity, we have nine numeric variables, three of which are continuous (longitude, latitude, and median_income) and six of which are discrete (housing_median_age, total_rooms, total_bedrooms, population, households, and median_house_value).
str(housing_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 20635 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...
For a better sense of the distribution of the nine numeric variables, we looked at histograms for each of them.
par(mfrow = c(3, 3))
hist(housing_data$longitude, breaks = 20, main = "longitude", border="darkorange", col="dodgerblue")
hist(housing_data$latitude, breaks = 20, main = "latitude", border="darkorange", col="dodgerblue")
hist(housing_data$housing_median_age, breaks = 20, main = "housing_median_age", border="darkorange", col="dodgerblue")
hist(housing_data$total_rooms, breaks = 20, main = "total_rooms", border="darkorange", col="dodgerblue")
hist(housing_data$total_bedrooms, breaks = 20, main = "total_bedrooms", border="darkorange", col="dodgerblue")
hist(housing_data$population, breaks = 20, main = "population", border="darkorange", col="dodgerblue")
hist(housing_data$households, breaks = 20, main = "households", border="darkorange", col="dodgerblue")
hist(housing_data$median_income, breaks = 20, main = "median_income", border="darkorange", col="dodgerblue")
hist(housing_data$median_house_value, breaks = 20, main = "median_house_value", border="darkorange", col="dodgerblue")
To better understand the relationship between the all the variables, we looked at the pairs.
pairs(housing_data, col = "dodgerblue")
It looks like there are some additional variables which may not be necessary due to collinearity. We looked further into the correlations between the numeric variables and showed them in the table below:
housing_data_nc = housing_data[, -10]#remove text variable for now
corrmatrix = cor(housing_data_nc)
kable(t(corrmatrix))
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| longitude | 1.0000000 | -0.9246761 | -0.1083936 | 0.0446418 | 0.0691635 | 0.0998814 | 0.0554001 | -0.0150904 | -0.0462078 |
| latitude | -0.9246761 | 1.0000000 | 0.0114618 | -0.0362306 | -0.0665653 | -0.1089785 | -0.0711986 | -0.0799766 | -0.1438369 |
| housing_median_age | -0.1083936 | 0.0114618 | 1.0000000 | -0.3612684 | -0.3190613 | -0.2961722 | -0.3028629 | -0.1189487 | 0.1052718 |
| total_rooms | 0.0446418 | -0.0362306 | -0.3612684 | 1.0000000 | 0.9270603 | 0.8571174 | 0.9184803 | 0.1979912 | 0.1343733 |
| total_bedrooms | 0.0691635 | -0.0665653 | -0.3190613 | 0.9270603 | 1.0000000 | 0.8735459 | 0.9743781 | -0.0076601 | 0.0495614 |
| population | 0.0998814 | -0.1089785 | -0.2961722 | 0.8571174 | 0.8735459 | 1.0000000 | 0.9072131 | 0.0047371 | -0.0244208 |
| households | 0.0554001 | -0.0711986 | -0.3028629 | 0.9184803 | 0.9743781 | 0.9072131 | 1.0000000 | 0.0129502 | 0.0660691 |
| median_income | -0.0150904 | -0.0799766 | -0.1189487 | 0.1979912 | -0.0076601 | 0.0047371 | 0.0129502 | 1.0000000 | 0.6885627 |
| median_house_value | -0.0462078 | -0.1438369 | 0.1052718 | 0.1343733 | 0.0495614 | -0.0244208 | 0.0660691 | 0.6885627 | 1.0000000 |
#highcorr = findCorrelation(corrmatrix, cutoff = .60)#this will give you highly correlated variables
TODOs:
First we want to split the data into training and testing data. We will use an 70/30 split of a randomized sample. We will use a set seed to get repeatability.
Note that we decided to partition using ocean_proximity since this is a predictor that we believe will play a large role in predicting housing prices.
set.seed(420)
housing_trn_idx = createDataPartition(housing_data$ocean_proximity, p = .70, list = FALSE)
## Warning in createDataPartition(housing_data$ocean_proximity, p = 0.7, list
## = FALSE): Some classes have no records ( ISLAND ) and these will be ignored
housing_trn_data = housing_data[housing_trn_idx, ]
housing_tst_data = housing_data[-housing_trn_idx, ]
As a starting point we want to get an idea which parameters are good ones to use in a potential model.
We can use leaps to tell us the “best” model, which is the one with the lowest Adjusted \(R^2\) for each number of parameters. Through testing we found this to select the full additive model when given the full additive as a starting point. It is more interesting when we use the full additive model with all two-way interactions.
Note from Jeanette - the plot / legend / data in the section below are showing up very weird for me. Maybe we can chat about this section this weekend? I have a few questions, too.
regsubsets.out = regsubsets(median_house_value ~ (longitude + latitude + housing_median_age + total_rooms + population + median_income + ocean_proximity) ^ 2,
data = housing_trn_data,
nbest = 1, # 1 best model for each number of predictors
nvmax = 10, # NULL for no limit on number of variables
force.in = NULL, force.out = NULL, method = "exhaustive", really.big=T)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 7 linear dependencies found
## Reordering variables and trying again:
# The output of this is long and not very pretty, there is better output below.
#regsubsets2.out
summary.out = summary(regsubsets.out)
dont_print_df = as.data.frame(summary.out$outmat)
## Adjusted R2
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")
#layout(matrix(1:2, ncol = 2))
## Adjusted R2
res.legend = subsets(regsubsets.out, statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")
## Mallow Cp - A small value of Cp means that the model is relatively precise
#res.legend = subsets(regsubsets2.out, statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2)
res.legend
## Abbreviation
## longitude ln
## latitude lt
## housing_median_age hs__
## total_rooms tt_
## population pp
## median_income md_
## ocean_proximityINLAND o_IN
## ocean_proximityNEAR BAY oB
## ocean_proximityNEAR OCEAN oO
## longitude:latitude lngtd:l
## longitude:housing_median_age ln:__
## longitude:total_rooms lngtd:t_
## longitude:population lngtd:p
## longitude:median_income lngtd:m_
## longitude:ocean_proximityINLAND ln:_INLAND
## longitude:ocean_proximityNEAR BAY ln:_NEARBAY
## longitude:ocean_proximityNEAR OCEAN ln:_NEAROCEAN
## latitude:housing_median_age lt:__
## latitude:total_rooms lttd:t_
## latitude:population lt:
## latitude:median_income lttd:m_
## latitude:ocean_proximityINLAND lt:_INLAND
## latitude:ocean_proximityNEAR BAY lt:_NEARBAY
## latitude:ocean_proximityNEAR OCEAN lt:_NEAROCEAN
## housing_median_age:total_rooms hsng_mdn_g:t_
## housing_median_age:population hs__:
## housing_median_age:median_income hsng_mdn_g:m_
## housing_median_age:ocean_proximityINLAND h__:_IN
## housing_median_age:ocean_proximityNEAR BAY hB
## housing_median_age:ocean_proximityNEAR OCEAN hO
## total_rooms:population tt_:
## total_rooms:median_income tt_:_
## total_rooms:ocean_proximityINLAND t_:_IN
## total_rooms:ocean_proximityNEAR BAY tB
## total_rooms:ocean_proximityNEAR OCEAN tO
## population:median_income pp:_
## population:ocean_proximityINLAND p:_IN
## population:ocean_proximityNEAR BAY pB
## population:ocean_proximityNEAR OCEAN pO
## median_income:ocean_proximityINLAND m_:_IN
## median_income:ocean_proximityNEAR BAY mB
## median_income:ocean_proximityNEAR OCEAN mO
## ocean_proximityISLAND o_IS
## longitude:ocean_proximityISLAND ln:_ISLAND
## latitude:ocean_proximityISLAND lt:_ISLAND
## housing_median_age:ocean_proximityISLAND h__:_IS
## total_rooms:ocean_proximityISLAND t_:_IS
## population:ocean_proximityISLAND p:_IS
## median_income:ocean_proximityISLAND m_:_IS
which.max(summary.out$adjr2)
## [1] 11
summary.out$which[which.max(summary.out$adjr2),]
## (Intercept)
## TRUE
## longitude
## TRUE
## latitude
## TRUE
## housing_median_age
## FALSE
## total_rooms
## FALSE
## population
## FALSE
## median_income
## FALSE
## ocean_proximityINLAND
## FALSE
## ocean_proximityISLAND
## FALSE
## ocean_proximityNEAR BAY
## TRUE
## ocean_proximityNEAR OCEAN
## FALSE
## longitude:latitude
## TRUE
## longitude:housing_median_age
## FALSE
## longitude:total_rooms
## FALSE
## longitude:population
## FALSE
## longitude:median_income
## TRUE
## longitude:ocean_proximityINLAND
## TRUE
## longitude:ocean_proximityISLAND
## FALSE
## longitude:ocean_proximityNEAR BAY
## TRUE
## longitude:ocean_proximityNEAR OCEAN
## FALSE
## latitude:housing_median_age
## FALSE
## latitude:total_rooms
## FALSE
## latitude:population
## FALSE
## latitude:median_income
## FALSE
## latitude:ocean_proximityINLAND
## FALSE
## latitude:ocean_proximityISLAND
## FALSE
## latitude:ocean_proximityNEAR BAY
## TRUE
## latitude:ocean_proximityNEAR OCEAN
## FALSE
## housing_median_age:total_rooms
## TRUE
## housing_median_age:population
## FALSE
## housing_median_age:median_income
## FALSE
## housing_median_age:ocean_proximityINLAND
## FALSE
## housing_median_age:ocean_proximityISLAND
## FALSE
## housing_median_age:ocean_proximityNEAR BAY
## FALSE
## housing_median_age:ocean_proximityNEAR OCEAN
## FALSE
## total_rooms:population
## FALSE
## total_rooms:median_income
## TRUE
## total_rooms:ocean_proximityINLAND
## FALSE
## total_rooms:ocean_proximityISLAND
## FALSE
## total_rooms:ocean_proximityNEAR BAY
## FALSE
## total_rooms:ocean_proximityNEAR OCEAN
## FALSE
## population:median_income
## TRUE
## population:ocean_proximityINLAND
## FALSE
## population:ocean_proximityISLAND
## FALSE
## population:ocean_proximityNEAR BAY
## FALSE
## population:ocean_proximityNEAR OCEAN
## FALSE
## median_income:ocean_proximityINLAND
## FALSE
## median_income:ocean_proximityISLAND
## FALSE
## median_income:ocean_proximityNEAR BAY
## FALSE
## median_income:ocean_proximityNEAR OCEAN
## FALSE
best_adjr2 = summary.out$adjr2[which.max(summary.out$adjr2)]
From this information we see that there are three models which have an Adjusted \(R^2\) close to the maximum: \(0.6645571\). We’ll use these three models as a starting point for model selection, as well as include a fourth model that we will introduce below.
best_leaps_model_1 = lm(median_house_value ~ longitude + latitude + I(ocean_proximity == "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:latitude + longitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population + population:median_income, data = housing_trn_data)
#summary(best_leaps_model_1)
names(coef(best_leaps_model_1))[-1]
## [1] "longitude"
## [2] "latitude"
## [3] "I(ocean_proximity == \"INLAND\")TRUE"
## [4] "I(ocean_proximity == \"NEAR BAY\")TRUE"
## [5] "longitude:latitude"
## [6] "longitude:median_income"
## [7] "longitude:I(ocean_proximity == \"NEAR BAY\")TRUE"
## [8] "housing_median_age:total_rooms"
## [9] "housing_median_age:population"
## [10] "median_income:population"
The first model to explore, best_leaps_model_1, uses the following predictors:
longitudelatitudeI(ocean_proximity == \"INLAND\")TRUEI(ocean_proximity == \"NEAR BAY\")TRUElongitude:latitudelongitude:median_incomelongitude:I(ocean_proximity == \"NEAR BAY\")TRUEhousing_median_age:total_roomshousing_median_age:populationmedian_income:populationThis is an interesting set of predictors and they make a lot of sense. As we saw in the graph in the Dataset section titled Data Map - Longtitude vs Latitude and Associated Variables, it looks like housing prices are influenced by large metropolitan areas such as San Francisco and Los Angeles, where longitude, latitude and ocean_proximity would all play a large role. Also, median_income may also be a proxy for the larger cities versus the more rural areas.
The second model is explore, best_leaps_model_2, has an Adjusted \(R^2\) of \(0.6645571\), which ties the Adjusted \(R^2\) of best_leaps_model_1. To extract this model we are using the graph of the Adjusted \(R^2\) to get the parameters from the second row from the top. There are similarities, but a couple differences here.
best_leaps_model_2 = lm(median_house_value ~ housing_median_age + I(ocean_proximity == "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:housing_median_age + longitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population, data = housing_trn_data)
summary(best_leaps_model_2)
##
## Call:
## lm(formula = median_house_value ~ housing_median_age + I(ocean_proximity ==
## "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:housing_median_age +
## longitude:median_income + longitude:I(ocean_proximity ==
## "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population,
## data = housing_trn_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -263456 -43938 -11418 29327 474940
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 7.501e+04 2.434e+03
## housing_median_age 5.528e+02 1.374e+03
## I(ocean_proximity == "INLAND")TRUE -8.331e+04 1.411e+03
## I(ocean_proximity == "NEAR BAY")TRUE -2.913e+07 1.475e+06
## housing_median_age:longitude -9.579e-02 1.147e+01
## longitude:median_income -2.811e+02 2.925e+00
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -2.382e+05 1.207e+04
## housing_median_age:total_rooms 7.705e-01 2.473e-02
## housing_median_age:population -1.047e+00 4.187e-02
## t value Pr(>|t|)
## (Intercept) 30.812 <2e-16 ***
## housing_median_age 0.402 0.688
## I(ocean_proximity == "INLAND")TRUE -59.049 <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE -19.748 <2e-16 ***
## housing_median_age:longitude -0.008 0.993
## longitude:median_income -96.104 <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -19.744 <2e-16 ***
## housing_median_age:total_rooms 31.152 <2e-16 ***
## housing_median_age:population -25.017 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69720 on 14437 degrees of freedom
## Multiple R-squared: 0.638, Adjusted R-squared: 0.6378
## F-statistic: 3180 on 8 and 14437 DF, p-value: < 2.2e-16
names(coef(best_leaps_model_2))[-1]
## [1] "housing_median_age"
## [2] "I(ocean_proximity == \"INLAND\")TRUE"
## [3] "I(ocean_proximity == \"NEAR BAY\")TRUE"
## [4] "housing_median_age:longitude"
## [5] "longitude:median_income"
## [6] "I(ocean_proximity == \"NEAR BAY\")TRUE:longitude"
## [7] "housing_median_age:total_rooms"
## [8] "housing_median_age:population"
The best_leaps_model_2 uses the following predictors:
housing_median_ageI(ocean_proximity == \"INLAND\")TRUEI(ocean_proximity == \"NEAR BAY\")TRUEhousing_median_age:longitudelongitude:median_incomeI(ocean_proximity == \"NEAR BAY\")TRUE:longitudehousing_median_age:total_roomshousing_median_age:populationWe see in this model, housing_median_age seems much more influential than in best_leaps_model_1. The ocean_proximity, however, is still quite important here.
The third model we’ll explore, best_leaps_model_3, also scored an adjusted \(R^2\) of \(0.6645571\).
best_leaps_model_3 = lm(median_house_value ~ median_income + I(ocean_proximity == "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:median_income + latitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population, data = housing_trn_data)
summary(best_leaps_model_3)
##
## Call:
## lm(formula = median_house_value ~ median_income + I(ocean_proximity ==
## "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:median_income +
## latitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") +
## housing_median_age:total_rooms + housing_median_age:population,
## data = housing_trn_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -276300 -41906 -12186 28334 448358
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 8.887e+04 1.925e+03
## median_income -6.143e+05 2.516e+04
## I(ocean_proximity == "INLAND")TRUE -5.354e+04 1.919e+03
## I(ocean_proximity == "NEAR BAY")TRUE -2.880e+07 1.435e+06
## median_income:longitude -7.659e+03 2.973e+02
## median_income:latitude -7.607e+03 3.090e+02
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -2.356e+05 1.174e+04
## housing_median_age:total_rooms 7.987e-01 2.402e-02
## housing_median_age:population -1.046e+00 4.077e-02
## t value Pr(>|t|)
## (Intercept) 46.18 <2e-16 ***
## median_income -24.42 <2e-16 ***
## I(ocean_proximity == "INLAND")TRUE -27.90 <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE -20.07 <2e-16 ***
## median_income:longitude -25.76 <2e-16 ***
## median_income:latitude -24.62 <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -20.07 <2e-16 ***
## housing_median_age:total_rooms 33.25 <2e-16 ***
## housing_median_age:population -25.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68530 on 14437 degrees of freedom
## Multiple R-squared: 0.6502, Adjusted R-squared: 0.65
## F-statistic: 3355 on 8 and 14437 DF, p-value: < 2.2e-16
names(coef(best_leaps_model_3))[-1]
## [1] "median_income"
## [2] "I(ocean_proximity == \"INLAND\")TRUE"
## [3] "I(ocean_proximity == \"NEAR BAY\")TRUE"
## [4] "median_income:longitude"
## [5] "median_income:latitude"
## [6] "I(ocean_proximity == \"NEAR BAY\")TRUE:longitude"
## [7] "housing_median_age:total_rooms"
## [8] "housing_median_age:population"
The best_leaps_model_3 uses the following predictors:
median_incomeI(ocean_proximity == \"INLAND\")TRUEI(ocean_proximity == \"NEAR BAY\")TRUEmedian_income:longitudemedian_income:latitudeI(ocean_proximity == \"NEAR BAY\")TRUE:longitudehousing_median_age:total_roomshousing_median_age:populationThis model is very similar to the best_leaps_model_2, except that median_income is more important here.
We will also include a fourth model, full_twoway_model, which includes all variables in the model as well as their two-way interactions.
full_twoway_model = twoway_mod_start = lm(median_house_value ~ (.)^2, data = housing_trn_data)
We can now compare the AIC of these four models.
extractAIC(best_leaps_model_1)
## [1] 11.0 321471.3
extractAIC(best_leaps_model_2)
## [1] 9.0 322217.7
extractAIC(best_leaps_model_3)
## [1] 9.0 321721.9
extractAIC(full_twoway_model)
## [1] 64.0 319327.3
We see that the model with the best (i.e., lowest) AIC is full_twoway_model, with a score of \(3.1932733\times 10^{5}\). However, although this model has the lowest AIC, it also has far more predictors (and therefore is more complex) than the other three models - \(64\) compared to \(11\) predictors for best_leaps_model_1, \(9\) predictors for best_leaps_model_2 and \(9\) predictors for best_leaps_model_3. This is just something to keep in mind as we move forward.
All 4 of these models are good candidates for forward and backward AIC and BIC.
TODOs: - None of the forward steps included a scope, though in the lectures / in the book they always included it. Might be worth stating why this is the case and make an argument for it?
# full_twoway_model
# AIC
back_twoway_mod_finish_aic = step(full_twoway_model, direction = "backward", trace = 0)
forward_twoway_mod_finish_aic = step(full_twoway_model, direction = "forward", trace = 0)
# BIC
n = length(resid(full_twoway_model))
back_twoway_mod_finish_bic = step(full_twoway_model, direction = "backward", k = log(n), trace = 0)
forward_twoway_mod_finish_bic = step(full_twoway_model, direction = "forward", k = log(n), trace = 0)
# best_leaps_model_1
# AIC
back_aic_mod_1 = step(best_leaps_model_1, direction = "backward", trace = 0)
forward_aic_mod_1 = step(best_leaps_model_1, direction = "forward", trace = 0)
# BIC
n = length(resid(best_leaps_model_1))
back_bic_mod_1 = step(best_leaps_model_1, direction = "backward", k = log(n), trace = 0)
forward_bic_mod_1 = step(best_leaps_model_1, direction = "forward", k = log(n), trace = 0)
# best_leaps_model_2
# AIC
back_aic_mod_2 = step(best_leaps_model_2, direction = "backward", trace = 0)
forward_aic_mod_2 = step(best_leaps_model_2, direction = "forward", trace = 0)
# BIC
n = length(resid(best_leaps_model_2))
back_bic_mod_2 = step(best_leaps_model_2, direction = "backward", k = log(n), trace = 0)
forward_bic_mod_2 = step(best_leaps_model_2, direction = "forward", k = log(n), trace = 0)
# best_leaps_model_3
# AIC
back_aic_mod_3 = step(best_leaps_model_3, direction = "backward", trace = 0)
forward_aic_mod_3 = step(best_leaps_model_3, direction = "forward", trace = 0)
# BIC
n = length(resid(best_leaps_model_3))
back_bic_mod_3 = step(best_leaps_model_3, direction = "backward", k = log(n), trace = 0)
forward_bic_mod_3 = step(best_leaps_model_2, direction = "forward", k = log(n), trace = 0)
aic_and_bic_results = data.frame(
"AIC" =
c("Backward" =
c("Two-Way" = extractAIC(back_twoway_mod_finish_aic)[2],
"Leap M1" = extractAIC(back_aic_mod_1)[2],
"Leap M2" = extractAIC(back_aic_mod_2)[2],
"Leap M3" = extractAIC(back_aic_mod_3)[2]),
"Forward" =
c("Two-Way" = extractAIC(forward_twoway_mod_finish_aic)[2],
"Leap M1" = extractAIC(forward_aic_mod_1)[2],
"Leap M2" = extractAIC(forward_aic_mod_2)[2],
"Leap M3" = extractAIC(forward_aic_mod_3)[2])),
"BIC" =
c("Backward" =
c("Two-Way" = extractAIC(back_twoway_mod_finish_bic)[2],
"Leap M1" = extractAIC(back_bic_mod_1)[2],
"Leap M2" = extractAIC(back_bic_mod_2)[2],
"Leap M3" = extractAIC(back_bic_mod_3)[2]),
"Forward" =
c("Two-Way" = extractAIC(forward_twoway_mod_finish_bic)[2],
"Leap M1" = extractAIC(forward_bic_mod_1)[2],
"Leap M2" = extractAIC(forward_bic_mod_2)[2],
"Leap M3" = extractAIC(forward_bic_mod_3)[2])))
kable(aic_and_bic_results)
| AIC | BIC | |
|---|---|---|
| Backward.Two-Way | 319323.9 | 319333.0 |
| Backward.Leap M1 | 321471.3 | 321471.3 |
| Backward.Leap M2 | 322215.7 | 322215.7 |
| Backward.Leap M3 | 321721.9 | 321721.9 |
| Forward.Two-Way | 319327.3 | 319327.3 |
| Forward.Leap M1 | 321471.3 | 321471.3 |
| Forward.Leap M2 | 322217.7 | 322217.7 |
| Forward.Leap M3 | 321721.9 | 322217.7 |
TODOs: - Jeanette to standardize variable names and how they are referred to (let’s not call them “leaps” since that is just the library that has the function for doing the exhaustive search)
The exhaustive models that we started with (best_leaps_model_1, best_leaps_model_2, and best_leaps_model_3) did not improve significantly. Neither did full_twoway_model which started with a full two-way interaction. However the back_twoway_mod_finish_aic model did have slight improvement over full_twoway_model, which was originally the best performing model.
TODOs: - Not choosing the ones coming out of the full twoway model - we can tie this maybe to the discussion of the first extractAIC and that this model is far more complex and therefore might not be ideal despite its lower AIC??
We now have the three models best_leaps_model_1, back_aic_mod_2 and best_leaps_model_3 to test going forward. Let’s check the assumptions on these three models.
TODOs: - update the diagnostic function, as it is currently just the one from the homework (with change to limit to 5000 to make it work for this dataset) <- I will work on this ~Josh
diagnostics = function(model, pcol = 'grey', lcol = 'dodgerblue', alpha = 0.05, plotit = TRUE, testit = TRUE) {
if(plotit == TRUE){
# Two plots, side-by-side
par(mfrow = c(1, 2))
# A fitted versus residuals plot
plot(fitted(model), resid(model), col = pcol, pch = 20, cex = 2, xlab = "Fitted", ylab = "Residuals", main = "Fitted vs. Residuals")
abline(h = 0, col = lcol, lwd = 2)
# A Normal Q-Q plot of the residuals
qqnorm(resid(model), col = pcol, pch = 20, cex = 2, main = "Normal Q-Q Plot")
qqline(resid(model), col = lcol, lwd = 2)
}
if(testit == TRUE){
# NOTE: the shapiro test is limited to the first 5000 records
# See: https://stackoverflow.com/questions/28217306/error-in-shapiro-test-sample-size-must-be-between
p_val = shapiro.test(resid(model)[0:5000])$p.value
decision = ifelse(p_val < alpha, "Reject", "Fail to Reject")
list(p_val = p_val, decision = decision)
}
}
diagnostics(best_leaps_model_1)
## $p_val
## [1] 6.065702e-46
##
## $decision
## [1] "Reject"
diagnostics(back_aic_mod_2)
## $p_val
## [1] 4.701471e-45
##
## $decision
## [1] "Reject"
diagnostics(best_leaps_model_3)
## $p_val
## [1] 6.605337e-48
##
## $decision
## [1] "Reject"
TODO: - Discuss the output of the diagnostics plot. - Thinking more about this, don’t these kind of diagnostics happen before model selection, to see if there’s anything else we need to do to transform the predictors or the response so that the data makes more sense? So I’m not sure this section belongs here.
# From the text: http://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.html
calc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_loocv_rmse(best_leaps_model_1)
## [1] 67979.89
calc_loocv_rmse(best_leaps_model_3)
## [1] 68573.52
calc_loocv_rmse(back_aic_mod_2)
## [1] 69751.4
The best_leaps_model_1 model has the lowest loocv_rmse.
TODOs: - Explain what this means and why it matters.
# the actual median house values from the test set
actual = housing_tst_data$median_house_value
# predict using the best_leaps_model_1
predicted = predict(best_leaps_model_1, newdata = housing_tst_data)
(mod1_error = 100 * mean((abs(actual - predicted)) / actual))
## [1] 29.26661
# predict using the best_leaps_model_3
predicted = predict(best_leaps_model_3, newdata = housing_tst_data)
(mod2_error = 100 * mean((abs(actual - predicted)) / actual))
## [1] 28.60524
# predict using the back_aic_mod_2
predicted = predict(back_aic_mod_2, newdata = housing_tst_data)
(mod3_error = 100 * mean((abs(actual - predicted)) / actual))
## [1] 29.22784
errors = sort(c(mod1_error, mod2_error, mod3_error))
These three models have very similar percentages of error, ranging from \(28.6052359\)% to \(29.2666087\)%.
TODO: these are likely better plots than what I stubbed in above, so they likely should be integrated in the right place.
#lets use Weighted Least Squares on this model to cover potential heteroskedasticity.
back_bic_mod_fitted = fitted(back_bic_mod)
back_bic_mod_resid = resid(back_bic_mod)
temp_wls_mod = lm(log(back_bic_mod_resid^2) ~ back_bic_mod_fitted + back_bic_mod_fitted^2)
ghat = fitted(temp_wls_mod)
hhat = exp(ghat)
WLS_back_bic_mod = update(back_bic_mod, weights = 1 / hhat)# to do: #figure out how to get call
plot(
back_bic_mod$fitted.values,
back_bic_mod$residuals
)
plot(
WLS_back_bic_mod$fitted.values,
WLS_back_bic_mod$residuals
)